fix: batch pending parent root fetches to avoid 300+ sequential round-trips by ch4r10t33r · Pull Request #694 · blockblaz/zeam

ch4r10t33r · 2026-03-25T17:01:54Z

Summary

Adds pending_parent_roots queue to BeamNode
cacheBlockAndFetchParent now enqueues the missing parent root instead of immediately firing an individual blocks_by_root request
New flushPendingParentFetches drains the map and sends one batched request for all accumulated roots
Flush is called at every natural processing exit point (missing-parent early return, handleGossipProcessingResult, processBlockByRootChunk)

Problem

When a syncing peer walks a long parent chain (e.g. 300 blocks), every missing parent triggered its own blocks_by_root request on a fresh libp2p stream. This flooded both sides with 300+ individual round-trips instead of a single batched one.

Effect

When multiple gossip blocks arrive in the same burst sharing the same missing ancestor, all their parent roots are collected into one map and sent as a single request. For the sequential case (one block reveals the next missing parent), the deduplication in the map also prevents duplicate fetches for the same root.

Test plan

zig build test --summary all passes

When the SSZ state grows beyond ~3 MB the server switches from sending a Content-Length response to Transfer-Encoding: chunked. The previous body-reading loop called readSliceShort which internally goes through: readSliceShort → readVec → defaultReadVec → contentLengthStream contentLengthStream accesses reader.state.body_remaining_content_length but that field is not active for chunked responses (state is 'ready'), causing a panic: thread 1 panic: access of union field 'body_remaining_content_length' while field 'ready' is active Replace the manual request/response loop with client.fetch() using a std.Io.Writer.Allocating as the response_writer. fetch() calls response.readerDecompressing() + streamRemaining() which dispatches through chunkedStream or contentLengthStream correctly based on the actual transfer encoding used by the server.

After checkpoint sync, incoming blocks may carry attestations whose target slots predate the finalized anchor. IsJustifiableSlot correctly identifies these as non-justifiable, but the `try` was propagating the error fatally, causing the entire block import to fail. This creates a cascading gap: block N fails → blocks N+1..M fail (missing parent) → no epoch-boundary attestations accumulate → justified checkpoint never advances → forkchoice stays stuck in `initing` indefinitely. Fix: catch InvalidJustifiableSlot and treat it as `false`. The attestation is then silently skipped via the existing !is_target_justifiable check, exactly as all other non-viable attestations (unknown source/target/head, stale slot, etc.) are handled. The block imports successfully, the chain catches up, and the node exits the initing state. Update the test that was asserting the old (buggy) error-propagation behaviour to instead assert that process_attestations succeeds.

After checkpoint sync the forkchoice starts in the initing state and waits for a first justified checkpoint before declaring itself ready. The status-response sync handler was checking getSyncStatus() and treating fc_initing the same as synced — doing nothing. This created a deadlock: the node never requested blocks from ahead peers because it was in fc_initing, and it could never leave fc_initing because no blocks were imported. Fix the deadlock in two places: 1. Status-response handler: add an explicit fc_initing branch that requests the peer's head block when the peer is ahead of our anchor slot. This mirrors the behind_peers branch but uses head_slot for the comparison (finalized_slot is not yet meaningful in fc_initing). 2. Periodic sync refresh: every SYNC_STATUS_REFRESH_INTERVAL_SLOTS (8) slots, re-send our status to all connected peers when not synced. This recovers from the case where all peers were already connected before the fix was deployed, so no new connection event fires and the status-response handler would never be re-triggered.

When a block arrives with a missing parent, the old code immediately sent an individual blocks_by_root request for that single parent root. A syncing peer walking a long parent chain (e.g. 300 slots back) would therefore open 300+ separate libp2p streams - one per ancestor - flooding both sides with individual round-trips. Replace the immediate fire-and-forget with a deferred queue: - Add `pending_parent_roots: AutoHashMap(Root, depth)` to BeamNode. - `cacheBlockAndFetchParent` now enqueues the missing parent root instead of calling `fetchBlockByRoots` directly. - `flushPendingParentFetches` drains the map and issues a single batched blocks_by_root request for all accumulated roots. - The flush is called at every natural exit point: after the missing-parent early-return in `onGossip`, at the end of `handleGossipProcessingResult`, and at the end of `processBlockByRootChunk`. When multiple gossip blocks arrive in the same burst with the same missing ancestor, all their parent roots are now collected and sent as one request instead of N separate requests.

ch4r10t33r · 2026-03-25T17:32:41Z

Closing in favour of a clean PR with only the relevant commits.

ch4r10t33r and others added 6 commits March 23, 2026 15:25

Merge branch 'main' into fix/checkpoint-sync-chunked-body

a47b037

node: fix missed flush in processBlockByRootChunk early return path

3c1c1df

ch4r10t33r changed the title ~~node: batch pending parent root fetches to avoid 300+ sequential round-trips~~ fix: batch pending parent root fetches to avoid 300+ sequential round-trips Mar 25, 2026

node: merge main into fix/checkpoint-sync-chunked-body

138d53b

ch4r10t33r closed this Mar 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: batch pending parent root fetches to avoid 300+ sequential round-trips#694

fix: batch pending parent root fetches to avoid 300+ sequential round-trips#694
ch4r10t33r wants to merge 7 commits intomainfrom
fix/checkpoint-sync-chunked-body

ch4r10t33r commented Mar 25, 2026

Uh oh!

ch4r10t33r commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ch4r10t33r commented Mar 25, 2026

Summary

Problem

Effect

Test plan

Uh oh!

ch4r10t33r commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant